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INTRODUCTION 


One of the problems commonly encountered in pattern 
recognition is the selection of effective features from a given set of 
measurements. The use of a large number of feature measurements 
increases the complexity of the size of and the computer time required 
by the classifier (Swain, 1972). For example, in remote sensing of 
earth resources and environment, the problem reduces down to the 
following: Given a set of N features (e.g. mul tispectral scanner 

channels), find a subset consisting of n channels which provides an 
optimal trade-off between classification cost and classification 
accuracy (Fu, 1970). For example, the SKYLAB mul ti spectral scanner 
(SI 92) has 13 channels and generally an analyst v/ants to use the best 
four or five of these channels for classification. 

The effectiveness of the features should be determined by 
performance of the recognition system, usually in terms of probability 
of correct recognition. Ideally, one would like to solve this problem by 
computing th.e probability of misclassification associated with each n- 
feature subset and then selecting the one giving best performance (Swain, 
1 972). However, it is generally not feasible to perform tiie required 
computations. Even when one assumes normal distribution, numerical 
integration is required which, in the multidimensional case, is 
impractical to carry out. Some of the techniques of feature selection are 
summarized below. 
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FEATURE SELECTION TECHNIQUES 


Fu (1970) has used a non-parametric feature selection 
technique based on the direct estimation of error probability. The 
proposed feature selection criterion was based on the direct estimation 
of samples. Maximum likelihood decision rule (MLDR) was used for 
classification. He pointed out that a large amount of computation time 
is required especially when the number of classes is large. Using 7530 
test samples, he applied the proposed nonparametric method of feature 
selection to crop classification. The results of his experiment are 
giv'n in Table I. He found that all the classes are separable for most 4 
feature subsets (41 sets). 


TABLE I 


RESULTS OF NONPARAMETRIC FEATURE SELECTION TECHNIQUE 


NONPARAMETRIC METHOD 


PARAMETRIC METHOD 

NUMBER OF 
FEAIURES 

BEST FEATURE 
SET 



PERCENT 

ERROR 

BEST FEATURE SET 

PERCENT 

ERROR 

1 

2 

3 

4 

X, 

Xi ,Xg 

Xi ,Xio ,x, , 

X, ,Xy ,Xi2 

41-feature set 

33.8 

3.1 

0.1 

0.0 

Xg 

Xl ,Xg 

X 1 , X 1 0 1 X 1 1 

Xl ,Xe ,Xi 0 ,Xi 1 

37.6 

10.6 
5.0 

4.9 


Many authors have studied the linear feature-space 
transformation techniques to apply for the feature selection problems. 

For example, Watanabe (1966) introduced the feature-space compression 
technique based on the Karhunen-Loeve (K-L) expansion. Fu (1971) tested 
the feature selection technique based on generalized K-L expansion on 
crop classification. The results were compared with those using the 
parametric feature selection technique. The MLDR was used for the 
classifier and the appropriate statistical parameters were estimated from 
training samples for each class. He found that the transformed p- 
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dimensional feature space was less effective than the same dimensional 
feature subspace for all p (- N = total ’number of available features), 
but the difference in performance for the 4-feature subset was only 1.3 
percent. The computation time required, on the other hand, was much 
shorter for the transformation technique. 

An intermediate quantity which is related to the 
classification accuracy is often used, as a basis for feature selection 
(Fu, 1971). Divergence between pattern classes has been proposed as a 

(f. 

criteria for feature selection. 

Divergence is defined for any two density functions. In 
the case of normal variables with unequal covariance matrices , it can be 
shown (Kailath, 1567) that 

°D(i,jlCi,C 2 cj =1 tr (I) -ip ' + - i 

+ i (U,-Up (U.-UpT] p . : p (1) 

It can be shown also that the probability of 
misclassification is a monotonTcany decreasing function of divergence. 
Thei'efore, features selected according to the magnitude of divergence 
will imply their corresponding discriminatory power between the classes 
i -and j. In other words , feature set Op is considered more effective 
than the feature set a„ if D(i , j ja_ ) >:D(i , j (o:. ) (Fu, 1970) . Divergence 

Xf U Xf 

is a distance measure between the two statistical distributions. It is 
an indirect measure of the ability of the classifier to successfully 
discriminate between them. ) : 

Fu (1970) assumed that feature vectors fop each class 
were gaussianly distributed. He used the 1 Ineaf classification procedure 
based on the maximum 1 ikel ihood decision rul e ( MLDR ) for mul ticl ass 
class ification probl ern by means of mi nimizihg the maximum probabi lity 
of overall niisclassificatton (minimax procedure, Anderson and Bahadue, 


He shov/ed that a monotonic functional relationship exists 

•r\ 

betv/een the probability of pairwise mi sclassification between the 
classes and the separability measure. In addition^ he showed that in the 
case of Gaussianly distributed pattern classes with equal covariance 
matrices, the divergence and the separability measure have a monotonic 
relationship. Nevertheless, it is clear that the separabil ity measure is 
a more general criterion for feature effectiveness. He tested the 
effectiveness of the feature sets by computing the percentage of 
miscl assif ication with 7530 test sampTes (approximately 1500 samples per 
class) classified by MLDR classifier and then selected the optimum 
feature sets from all possible combinations. He found, from experimental 
results, that it is possible for smaller size feature subsets to be 
almost as effective as the complete feature set. Thus, in many 
situations, selecting optimum feature subset considerably reduces the 
computer time required for classification, as compared to using the 
entifa feature set, with a relatively small loss of classification 
accuracy. 

Although divergence only provides a measure of the 
distance between two class densities, its use is extended to the 
multiclass case by taki ng the average over all class pairs (Fu, 1971). 

If is the divergence betv/een classes i and j, then 
the multiclass feature selection criterion is 



'2 

m(m-l ) 



r 



Another strategy ts to maximize the minimum pai rwise 
divergence (Grettenberg, 1963y Fu and Chen, 1969; Kadota and Shepp, 
1967 ; Swain, 1972) i .e. , to select the feafure combination which does 
the best job of separating the "hardest to separate" pair of classes, i. 
e. , for example,: consider a situation where 'there are 3 classes A , B and 
C ... fl- V 



Min (0 


AB 





Where D 


AB 


= divergence bctv/een class A and class B 


The relationship between the divergence and 
classification accuracy is highly nonlinear (in fact, divergence 
increases v/ithout bound as the class separability increases, whereas 
probability of correct cTassification must saturate at 100 percent), and 
it is found that widely separable classes make too much of a 
contribution to as compared with less separable classes. As a 
result, in problems involving a wide range of class separabilities, 
is not a reliable criterion for feature selection. 


On the other hand, is based on selecting the 

channels which do the best job of separating the hardest-to-separate 
pair of classes. Although this is certainly a reasonable strategy in 
many' remote sensing of earth resources' problems, there is no guaranty 
that it is the optimal one. 


As pointed out before, as the separabil ity of a pair off 
classes increases, the pairwise divergence also increases without limit 
but the probability of correct cTassification saturates at 100 percent. 
A modified form of divergence, referred to as the "transformed 
divergence", Dj, has a behavior more like the probability of correct 
classification than divergence (Swain and King, 1973). 



exp (- 


where D is the divergence discussed above. The saturating behavior of 
•this function reduces the effects of widely separated classes when i! 
taking the average .over all pairwise separations, based on | 

transformed divergence has been found a iiiuch more rel iabie criterion fer 
feature selection than D^y^, based on: ''ordinary" divergence. 

Swain et aT._ (1971) have shown' experimentally that a 
separability measure referreji; to as the B-distance, based on Bhattacharyya'.s 
ccefficicTilr;'^^ much more reliable criterion than divergence, 



presumably because as a function of class scparabll tty, i t behaves more 
like the probability of correct classification. For tv/o densities 
Pi(x) and p2(x), the B-distance is given by 



v^“p7T^ 


X 


V^x)" 


dx 


( 5 ) 


Swain and King (1973) performed an experiment to compare 
the separability measures divergence, transformed divergence and B~ 
distance. Based on typical second order statistics derived from real 
remote sensing data, 2790 sets ,-of Gaussianly distributed artificial data 
were generated: each set contained 1000 observations for each of tv/o 

pattern classes in a feature space of dimensionality ranging from 1 to 6 
(465 sets were genera ted for each dimension 1, 2, ... 6) . For each set , 
the divergence, transformed divergence and B-distance were computed, and 
the actual classification error for the 2000 observations was taken as 
the associated probabi 1 i ty of error. They found that both transformed 
divergence and B-distance are much better measures for feature 
selection than divergence. In addition, B-distance v/aS found to be a 
slightly better measure of feature selection as compared to the 
transformed divergence. 


COMPARISON OF FEATURE SELECTION TECHNIQUES 

i For our eompara five study, aircraft multi spectral 

scanner data (MSS). over six selected flightlines v/ere analysed in 
subsets of one to tv/elve spectral channels covering the visible, near 
infrared, middle infrared and thermal infrared wavelength regions. The 
data of these flightl ines were of good quality and free from problems 
such as 1 ack of sufficient ground observatiGns, excessive cloud cover, 
excessive sun angle effects etc. Black and white aerial photography and 
gray scale 'print-outs of the flightl ines in the spectral channels v/ere 
used to aid in locating the boundaries of the agricuTtural fields. 
Sufficient number of fields of each agricultural cover were selected 
carefully so that they could be assumed to be representative of the 
flightline. 
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' Transformed divergence, defined in eq, (4), was used 

throughout this study. . 

Let and denote the average transformed 

divergence and the minimum transformed divergence, computed over all 
possible pairs of classes (each agricultural cover was treated as a 
separate class). Assuming a multivariate gaussian distribution for each 
class, the feature selection algorithm was used to select the best 
combinations of one to eleven spectral channels out of the twelve 
available spectral channels, using each of the following criteria of 
feature selection based on the values of 

lAVb 

1. Select the best subset of n (n-1 to 11) spectral channels as 

being the one that maximizes by exhaustive search of all 

possible combinations of n spectral channels out of the 12 
available channels. 

2. Select the best subset of n (n=l to .11) spectral channels 

using "forward feature selection". In forward feature selection, 
the best individual channel is selected on the first round, and 
then the best pair including the best one channel is selected, 
etc.,, ■ ■ - . ,, ■ • 

3. Select the best subset of n (n=l to 11) spectral channels using 
"backward feature selection".. This method is a counterpart to 
forv/ard feature selection, consisting of a sequential rejection 

- procedure,, in which one finds the '"best" set of features by 
^ finding a set of (N-1 ) features discarding the worst one, then; 

choosing the best set of (N-2) among the preceeding (N~l) ' ^ 

selected features, etc.. 

From the values of the average transformed divergence 
^TAVG’ P*"ob3bility of correct classiftcation (P^) was estimated 
using the curve of Swain and King (1973). Table II compares the values 
of obtained by exhaustive search , forward feature selection and 
backv/ard feature selection. It shows that forward feature selection 
gives almost as good results as’ the exhaustive search. Data of more 
flightlines are being analysed to check these resul/ts. 
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Althoi^ii comparisons of feature selection techniques have 
been done and reported by many authors in the past, the present analysis 
is the first, as far as the author knows, to be done systematically on a 
large quantity of good quality earth resources data, covering visible, 
near infr^yed, middle infrared and thermal infrared portions of the 
spectrum. 

The author gratefully acknowledges; the Laboratory for 
Applications of Remote Sensing, Purdue University, for their permission 
to use the mul tispectral scanner data, obtained under the NASA Grant No. 
NGL 15'-005-llZ; Dr, Celso de Renna e Souza for his continuous 
encouragement and assistance and Dr. Nelson de Jesus Parada, the Director 
of the Instituto de Pesquisas Espaciais (INPE) for his permission to 
publ ish this work. 

• table it • 

COMPARISON OF FEATURE SELECTION TECHNIQUES 


NUMBER OF 
CHANNELS 
IN THE 
SUBSET 

■- . 

P : EXHAUSTIVE 
SEARCH 

P„: FORWARD FEATURE 
^ SELECTION 

P : BACKWARD 

^ FEATURE 
SELECTION 

1 

84.84 

84.84 

34.84 

2 

90.16 

90.16 

87.71 

. 3 

92.59 

92.28 

90.39 

: ‘ 4 

94.38 

94.11 

91.38 

.5. .. 

95.35 

,, ■ 95.35 

92.87 

f 6 

95.93 ' 

: 95.88 

93.12 , 

7 • 

96.26 • 

: ■ 96.26 

93'.85 . 

8 

96.54: 

96.49 ■ 

94.12 

9 

96.73 

96.68 

95.47 

10 _ 

96.85 

96.86 : 

96,48 

IT- 

96.92 

96.92 

96.74 


NOTE; P^ denotes the probability of correct classification 

estimated from the values of average transformed divergence 
using the curve of Swain and. King (1973) . 
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